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ABSTRACT 

In many real- world networks, nodes have class labels, at- 
tributes, or variables that affect the network's topology. If 
the topology of the network is known but the labels of the 
nodes are hidden, we would like to select a small subset of 
nodes such that, if we knew their labels, we could accu- 
rately predict the labels of all the other nodes. We develop 
an active learning algorithm for this problem which uses 
information-theoretic techniques to choose which nodes to 
explore. We test our algorithm on networks from three dif- 
ferent domains: a social network, a network of English words 
that appear adjacently in a novel, and a marine food web. 
Our algorithm makes no initial assumptions about how the 
groups connect, and performs well even when faced with 
quite general types of network structure. In particular, we 
do not assume that nodes of the same class are more likely 
to be connected to each other — only that they connect to 
the rest of the network in similar ways. 

Categories and Subject Descriptors: 
1.2.6 [Artificial Intelligence]: Learning 
G.2.2 [Discrete Mathematics]: Graph theory 

General Terms: Algorithms, Experimentation, Theory 

Keywords: complex networks, structure and function, com- 
munity detection, information theory, active learning, collec- 
tive classification, transductive graph labeling 

1. INTRODUCTION 

In many social, biological, and technological networks, 
nodes have underlying attributes or variables that are cor- 
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related with the network's topology. Blogs tend to link to 
other blogs with similar political views [l]. In vertebrate 
food webs, predators tend to eat prey whose mass is smaller, 
but not too much smaller, than their own [TT]. Networks 
of word adjacencies are correlated with those words' parts 
of speech [30]. In the Internet, different types of service 
providers form different kinds of links based on their capac- 
ities and business relationships [S] [13] — and so on. 

There has been a great deal of work on efficient algorithms 
for community detection in networks (see |12l [32| for re- 
views). However, most of this work defines a "community" 
as a group of nodes with high density of connections within 
the group and a low density of connections to the rest of the 
network. While this type of assortative community struc- 
ture is common in social networks, we are interested in a 
more general definition of functional community — a group 
of nodes that connect to the rest of the network in similar 
ways. A set of predators might form a functional group in a 
food web, not because they eat each other, but because they 
eat similar prey. In English, nouns often follow adjectives, 
but seldom follow other nouns. Even some social networks 
have disassortative structure where pairs of nodes are more 
likely to be connected if they are from different classes. For 
example, some human societies are divided into moieties, 
and only allow marriages between different moieties [21j . 

We consider a setting where the topology of the network 
is known, but the class labels of the nodes are not. This 
could be the case, for instance, if we have a network of 
blogs and hyperlinks between them (like citations, track- 
backs, blogroUs, etc.) and we are trying to classify the blogs 
according to their political leanings. Another possible ap- 
plication is in online social networks, where friendships are 
known and we are trying to infer hidden demographic vari- 
ables. This problem is sometimes referred to as collective 
classification [35]. However, in that work the focus is on 
classification of individual nodes. In contrast, our focus is 
on the discovery of functional communities in the network, 
and our underlying generative model is designed around the 
assumption of that these communities exist. 

We make no initial assumptions about the structure of the 
network — for instance, whether its groups are assortative, 
disassortative, or some mixture of the two. We assume that 
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we can learn the label of any given node, but at a cost, say 
in terms of work in the field or laboratory. Our goal is to 
identify a small subset of nodes such that, once we explore 
them and learn their labels, we can accurately predict the 
labels of all the others. 

We present a general approach to this problem. Our algo- 
rithm uses information-theoretic measures to decide which 
node to explore next — that is, which one will give us the 
most information about the rest of the network. We start 
with a probabilistic generative model of the network, called 
a stochastic block model |20l 138] . in which groups connect 
to each other according to a matrix of probabilities. This 
model allows an arbitrary mixture of assortative and disas- 
sortative structure, as well as directed links from one group 
to another, and has been used to model networks in many 
fields (e.g. [l[ll[33]). 

We stress, however, that our approach could be applied 
equally well to many other probabilistic models, such as 
those where nodes belong to a mixture of classes a hier- 
archy of classes and subclasses [TD] , locations in a latent geo- 
graphical or social space [T7] , or niches in a food web [39] . It 
could also be applied to degree-corrected block models such 
as those in |23l 1261 131) . which treat the nodes' degrees as 
parameters rather than data to be predicted. 

At each stage of the learning process, some of the nodes' 
labels are already known and we need to decide which node 
to explore next. We do this by estimating, for each node, the 
mutual information between its label and the joint distribu- 
tion of all the others' labels, conditioned on the labels of the 
nodes that are known so far. We obtain this estimate by 
Gibbs sampling, giving each classification of nodes a prob- 
ability integrated over the parameters of the block model. 
We then explore the node for which this mutual information 
is largest. 

A key fact about the mutual information, which we ar- 
gue is essential to our algorithm's performance, is that it 
is not just a measure of uncertainty: it is a combination of 
uncertainty about a node's label and the extent to which it 
is correlated with the labels of other nodes. Thus the algo- 
rithm explores nodes which maximize the expected amount 
of information it will gain about the entire network. It skips 
nodes whose labels seem obvious to it, or which are uncer- 
tain but have little effect on other nodes. In an assortative 
network, for instance, it starts by exploring nodes which are 
central to their communities, and then explores nodes along 
the boundaries between them, without being told in advance 
to pursue this strategy. 

We also present an alternate approach which maximizes a 
quantity we call the average agreement. For each node v, this 
is the average number of nodes at which two independent 
samples of the Gibbs distribution agree, conditioned on the 
event that they agree at v. Like mutual information, average 
agreement is high for nodes that are highly correlated with 
the rest of the network. A similar idea (but not applied to 
networks) is present in [34| . 

We test our algorithm on three real-world networks: the 
social network of a karate club, a network of common adja- 
cent words in a Charles Dickens novel, and a marine food 
web of species in the Antarctic. Each of these networks is 
curated in the sense that we possess the correct node labels, 
such as the faction of the social network each individual be- 
longs to, the part of speech of each word, or the part of 
the habitat each species lives in. We judge our algorithm 



according to how accurately it predicts the labels of the un- 
explored nodes, as a function of the number of nodes it has 
explored so far. We also compare our algorithm with several 
simple heuristics, such as exploring nodes based on their de- 
gree or betweenness centrality, and find that it significantly 
outperforms them. 



2. RELATED WORK 

The idea of designing experiments by maximizing the mu- 
tual information between the variable we learn next and the 
joint distribution of the other variables, or equivalently the 
expected amount of information we gain about the joint dis- 
tribution, has a long history in statistics, artificial intelli- 
gence, and machine learning, e.g. Mackay [25] and Guo and 
Greiner [16]. Indeed, it goes back to the work of Lindley [24] 
in the 1950s. However, to our knowledge this is the first time 
it has been coupled with a generative model to discover hid- 
den variables in networks. 

In recent work, Zhu, Lafferty, and Ghahramani [31] study 
active learning of node labels using Gaussian fields and har- 
monic functions defined using the graph Laplacian. How- 
ever, this technique only applies to networks where neigh- 
boring nodes are likely to be in the same class — that is, 
networks with assortative community structure. In contrast, 
our techniques are capable of learning about much more gen- 
eral types of network structure, including disassortative and 
directed relationships between functional communities. 

Another approach to active learning of node labels is found 
in the work of Bilgic and Getoor ,6 and Bilgic, Mihalkova, 
and Getoor 7,, who use collective vector-based classifiers. 
By properly defining the collective relationships between 
nodes, both assortative or disassortative communities can 
be learned in this framework. However, our technique dif- 
fers from theirs by using mutual information as the active 
learning criterion, which takes into account not just uncer- 
tainty, but correlations as well. 

Additional works by Goldberg, Zhu, and Wright [14] and 
Tong and Jin [36] also perform semi-supervised learning on 
graphs, and handle the disassortative case. But they work in 
a setting where they know, for each link, if the ends should 
have the same or different labels, such as if one writer quotes 
another with pejorative words. In contrast, we work in a set- 
ting where we have no such information: only the topology 
is available to us, and there are no signs on the edges telling 
us whether we should propagate similar or dissimilar labels. 



3. MODEL AND METHODS 

We represent our network as a directed graph G = (V, E) 
with n nodes. We assume that there are k classes of nodes, 
so that each node v has a class label t{v) € {1, . . . , k}. We 
are given the graph G, and our goal is to learn the labels t{v). 
To do this, we assume that G is generated by a probabilistic 
model, in which its topology is correlated with these labels. 

The simplest such model, although by no means the only 
one to which our methods could be applied, is a stochastic 
block model [201 138] . It assumes that for each pair of nodes 
u, V, there is an edge from uto v with a probability Pt[u),t{v) 
that depends only on their labels, and that these events are 
independent. Given a classification, i.e., a function t -.V ^ 
{1, . . . , fc} assigning a label to each node, the probability of 



generating a given graph G in this model is 

Here rn = \{y : t(v) — i}\ is the number of nodes of class 
i, and dj = |{(w, v) £ E : t(u) — i, t{v) = j}\ is the number 
of edges from nodes of class i to nodes of class j. If we wish 
to focus on undirected graphs, we can modify this expression 
by restricting the product over pairs of classes with i < j. 
We can also forbid self-loops, if we wish, by replacing in 
the term i = j with ni{ni — 1) or C^') in the directed or 
undirected case respectively. 

This kind of stochastic block model is well-known in the 
machine learning, statistics, and network communities O 
1371 1151 1181 1191 133) and has also been used in ecology to 
identify groups of species in food webs [4]. Unlike e.g. [371 
1181 119) . we do not assume that pij takes one value when 
i = j and a smaller value when i ^ j. In other words, we 
do not assume an assortative community structure, where 
nodes are more likely to be connected to other nodes of the 
same class. Nor do we require in general that pij = pji, 
since the directed nature of the edges may be important — 
for instance, in a food web or word adjacency network. 

If all classifications t are equally likely a priori, then Bayes' 
rule implies that the Gibbs distribution on the classifica- 
tions, i.e., the probability of t given G, is proportional to 
the probability of G given t: 



P{t\G) oc P{G\t) 



(2) 



In order to define P{G\t), we need to integrate P{G\t,p) 
over some prior probability distribution on p. If we assume 
that the pij axe independent, then this integral factors over 
the product ([T]). In particular, if each pij follows a beta 
prior, we have the Bayesian estimate of edge probabilities 
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For reasonable choices of the hyperparameters a and /3, 
the prior dominates only in small data cases, such as very 
small networks or sparsely populated classes. For such small 
data cases, the beta prior allows the user to input some 
domain knowledge about, say, the (dis)assortativity of the 
target network's community structure. In the limit of large 
data, the prior will wash out and the data-driven community 
structure will dominate. 



If the user wishes to remain agnostic, however, he or she 
can specify a uniform prior (a = /? = 1) and allow the 
learning algorithm to estimate the degree of assortativity, 
disassortativity, directedness, and so on entirely from the 
data. We take this approach in this paper, in which case 



p{G\t)^ n 



1 



(4) 



An even simpler approach is to assume that the pij take 
their maximum likelihood values 



Pij = argmax P{G\t,p) = eij/niUj 
p 



(5) 



and set P{G \ t) = P{G \ t,p). This approach was used, for 
instance, for a hierarchical block model in [TD]. When k is 
fixed and the rn are large, this will give results similar to (|4]), 
since the integral over p is tightly peaked around p. How- 
ever, for any particular finite graph it makes more sense, at 
least to a Bayesian, to integrate over the pij , since they obey 
a posterior distribution rather than taking a fixed value. 
Moreover, averaging over the parameters as in Q discour- 
ages overfitting, since the average likelihood goes down when 
we increase k and hence the volume of the parameter space. 
This gives us a principled way to determine k automatically, 
although in this paper we set k by hand. Other methods 
to determine k include minimum description length (MDL) 
techniques f331 and the Akaike information criterion [3] 

4. ACTIVE LEARNING 

In the active learning setting, the algorithm can learn the 
class label of any given node, but at a cost — say, by devoting 
resources in the laboratory or the field. Since these resources 
are limited, it has to decide which node to explore. Its goal 
is to explore a small set of nodes and use their labels to guess 
the labels of the remaining nodes. 

One natural approach is to explore the node v with the 
largest mutual information (MI) between its label t{v) and 
the labels t{G\v) of the other nodes according to the Gibbs 
distribution We can write this as the difference between 
the entropy of t{G\v) and its conditional entropy given t{v), 

Ml{v) = I{v; G\v)^ H{G \v)- H{G \v\v) . (6) 

Here H{G\v \ v) is the entropy, averaged over t{v) according 
to the marginal of t{v) in the Gibbs distribution, of the joint 
distribution of t{G \ v) conditioned on t{v). In other words, 
Ml{v) is the expected amount of information we will gain 
about t{G \v), or equivalently the expected decrease in the 
entropy, that will result from learning t{v). 

Since the mutual information is symmetric, we also have 



Ml(v) = I(v; G\v)^ H{v) - H{v \G\v) 



(7) 



where H{v) is the entropy of the marginal distribution of 
t{v), and H{v \ G\ v) is the entropy, on average, of the dis- 
tribution of t{v) conditioned on the labels of the other nodes. 
Thus MI(w) is large if (i) we are uncertain about v, so that 
H{v) is large, and (ii) v is strongly correlated with the other 
nodes, so that H{v\G\v) is small. 

We estimate these entropies by sampling from the space of 
classifications t according to the Gibbs distribution. Specif- 
ically, we use a single-site heat-bath Markov chain. At each 
step, it chooses a node v uniformly from among the unex- 
plored nodes, and chooses its label t{v) according to the 



conditional distribution proportional to P{G\t), assuming 
that the labels of all other nodes stay fixed. In addition to 
exploring the space, this allows us to collect a sample of the 
conditional distribution of the chosen node v and its entropy. 
Since H{v\G\v) is the average of the conditional entropy, 
and since H{v) is the entropy of the average conditional dis- 
tribution, we can write 



I{v; G\v) = -Y, {P^) In (P^) + {Y.P^^P 



(8) 



where Pi is the probability that t{v) = i and (■) denotes the 
average, according to the Gibbs distribution, over the labels 
of the other nodes. 

We offer no theoretical guarantees about the mixing time 
of this Markov chain, and it is easy to see that there are 
families of graphs and values of k for which it it takes ex- 
ponential time. However, for the real-world networks we 
have tried so far, it appears to converge to equilibrium in 
a reasonable amount of time. We test for equilibrium by 
measuring whether the marginals change noticeably when 
the number of updates is increased by a factor of 2. We im- 
prove our estimates by averaging over many runs, each one 
starting from an independently random initial state. 

We say that the algorithm is in stage j if it has already 
explored j nodes. In that stage, it estimates Ml{v) for each 
unexplored node v, using the Markov chain to sample from 
the Gibbs distribution conditioned on the labels of the nodes 
explored so far. It then explores the node v with the largest 
MI. We provide it with the correct value of t{v) from the 
curated network, and it moves on to the next stage. 

The mutual information is not the only quantity we might 
use to identify which node to explore. Another is the average 
agreement, which we define as follows. Given two classifica- 
tions ti,t2, define their agreement as the number of nodes 
on whose labels they agree. 



\tint2\ = \{v : ti{v) = t2{v)}\ 



(9) 



Since our goal is to label as many nodes correctly as pos- 
sible, we wish we could maximize the agreement between 
an classification ti, drawn from the Gibbs distribution, and 
the correct classification t2. However, the algorithm doesn't 
know t2, so it assumes that it is drawn from the Gibbs dis- 
tribution as well. Exploring v projects onto the part of the 
joint distribution of (ti,i2) where ti{v) = t2{v). So, we 
define AA(-i;) as the expected agreement between two clas- 
sifications ti , t2 drawn independently from the Gibbs distri- 
bution, conditioned on the event that they agree at v. 



j:t„t,:t,i.) = t,iv)Pitl)Pit2)\tint2\ 



t2.ti{v) = t2{v) 



P{tl)P{t2 



(10) 



We estimate the numerator and denominator of AA(w) us- 
ing the same heat-bath Gibbs sampler as for MI(«), except 
that we sample independent pairs of classifications {t\,t2) 
by starting the Markov chain at two independently random 
initial states. 

5. RESULTS AND DISCUSSION 

We tested our algorithms on three different networks from 
three different fields. The first is Zachary's Karate Club [40j . 
As shown in Fig. [T] this is a social network consisting of 34 
members of a karate club, where undirected edges represent 




Figure 1: Zachary's Karate Club. 



friendships. The club split into two factions, indicated by 
diamonds and circles respectively. One of them centered 
around the instructor (node 1) and the other around the 
club president (node 34), each of which formed their own 
club. Shaded nodes are more peripheral, and have weaker 
ties to their communities. This network is highly assortative, 
with a high density of edges within each faction and a low 
density of edges between them. 

We judge the performance of each algorithm by asking, 
at each stage and for each node, with what probability the 
Gibbs distribution assigns it the correct label. In each stage 
we sampled the Gibbs distribution using 100 independently 
chosen initial conditions, doing 2x 10* steps of the heat-bath 
Markov chain for each one, and computing averages using 
the last 10* steps. Increasing the number of Markov chain 
steps to 10^ per stage produced only marginal improvements 
in performance. Fig. [5] shows what fraction of the unex- 
plored nodes are assigned the correct label with probability 
at least g, for various thresholds q — 0.1,0.3,0.5,0.7,0.9, as 
a function of the stage j. 

After exploring just four or five nodes, both of our algo- 
rithms succeed in correctly predicting the labels of most of 
the remaining nodes — i.e., to which faction they belong — 
with high accuracy. The AA algorithm performs slightly 
better than MI, achieving an accuracy close to 100% after 
exploring nine nodes. Of course, the Karate Club network 
is quite small, and there are many community-finding algo- 
rithms that classify the two factions with perfect or near- 
perfect accuracy [5^[T^ . 

Perhaps more interesting is the order in which our algo- 
rithms choose to explore the nodes. In Fig. O we sort the 
nodes in order of the median stage at which they are ex- 
plored. Error bars show 90% confidence intervals over 100 
independent runs of each algorithm. Some nodes show a 
large variance in the stage in which they are explored, while 
others are consistently explored at the beginning or end of 
the process. Both algorithms start by exploring nodes 1 and 
34, which are central to their respective communities. Note 
that these nodes are chosen, as we argued above, not just be- 
cause their labels are uncertain, but because they are highly 
correlated with the labels of other nodes. 

After learning that nodes 1 and 34 are in class 1 and 2 re- 
spectively, the algorithms "know" that the network consists 
of two assortative communities. They they explore nodes 
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Figure 2: Results of the active learning algorithms 
on Zachary's Karate Club network. 
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Figure 3: The order in which the active learning 
algorithms explore nodes in Zachary's Karate Club. 



Figure 4: The order in which the active learning al- 
gorithm MI explores nodes in word adjacency net- 
work from the novel David Copperfield. 



such as 3, 9, and 10 which lie at the boundary between 
these communities. Once the boundary is clear, they can 
easily predict the labels of the remaining nodes. The last 
nodes to be explored are those such as 2, 4, and 24, which 
lie so deep inside their communities that their labels are not 
in doubt. 

The second network consists of the 60 most commonly oc- 
curring nouns and the 60 most commonly occurring adjec- 
tives in Charles Dickens' novel David Copperfield. A directed 
edge connects any pair of words that appear adjacently in 
the text, pointing from the preceding word to the following 
one. Excluding eight words which are disconnected from the 
rest leaves a network with 112 nodes [29j. Unlike Zachary's 
Karate Club, this network is both directed and highly dis- 
assortative. Of the 1494 edges, 1123 of them point from 
adjectives to nouns. This lets us classify most nodes early 
on, simply by labeling a node as an adjective or noun if its 
out-degree or in-degree is large. 

Accordingly, our algorithms focus their attention on words 
about which they are uncertain, like "early," "low," and "noth- 
ing," whose out-degrees and in-degrees in the text are roughly 
equal, and words like "perfect" that precede words of both 
classes (see Fig. |4l where green and yellow nodes represent 
nouns and adjectives respectively; rectangular nodes are ex- 
plored first, and elliptical ones last). Once these nodes are 
resolved, both algorithms achieve high accuracy — 80% ac- 
curacy after exploring 20 nodes and close to 100% after ex- 
ploring 65 nodes (see Fig. [5} . 

In each stage we sampled the Gibbs distribution using 100 
independently chosen initial conditions, doing 5 x 10'* steps 
of the heat-bath Markov chain for each one, and comput- 
ing averages using the last 2.5 x 10* steps. Increasing the 
number of Markov chain steps to 10^ per stage produced 
only marginal improvements in performance. As in Fig. [51 
the y-axis shows the fraction of unexplored nodes which are 
labeled correctly by the conditional Gibbs distribution with 
probability at least q, for q = 0.1, 0.3, 0.5, 0.7, 0.9. The per- 
formance of the two algorithms is similar in the later stages, 
but unlike the Karate Club, here MI performs noticeably 
better than AA in the early stages. 

The third network is a food web of 488 species in the 
Weddell Sea in the Antarctic [TTl |9l |22] , with edges pointing 
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Figure 5: Results of the active learning algorithms on word adjacency network in the novel David Copperfield 
by Charles Dickens. 




Figure 6: Results for the Weddell Sea food web. 



to each predator from its prey. This data set is very rich, 
but we focus on two particular variables — the feeding type 
and the habitat in which the species lives. The feeding type 
takes k — 6 values, namely primary producer, omnivorous, 
herbivorous/detrivorous, carnivorous, detrivorous, and car- 
nivorous/necrovorous. The habitat variable takes = 5 val- 
ues, namely pelagic, benthic, benthopelagic, demersal, and 
land- based. 

We show results of our algorithms for both variables in 
Fig. [B] The results are averaged over 100 runs of each algo- 
rithm. In each stage we sampled the Gibbs distribution us- 
ing 100 independently chosen initial conditions, doing 5 x 10* 
steps of the heat-bath Markov chain for each one, and com- 
puting averages using the last 2.5 x 10* steps. For the feed- 
ing type, after exploring half the nodes, both algorithms 
correctly label about 75% of the remaining nodes. For the 
habitat variable, both algorithms are less accurate, although 
A A performs somewhat better than MI. Note that the ac- 
curacy only includes the unexplored nodes, not the nodes 
we have already explored. Thus it can decrease if we ex- 
plore easily-classified nodes early on, so that hard-to-classify 
nodes form a larger fraction of the remaining ones. 

Fig. [6] shows that both algorithms get to a state where 
they are confident, but wrong, about many of the unexplored 
nodes. For the feeding type variable, for instance, after the 
AA algorithm has explored 300 species, it labels 75% of the 
remaining nodes correctly with probability 90%, but it labels 
the other 25% correctly with probability less than 10%. In 
other words, it has a high degree of confidence about all 
the nodes, but is wrong about many of them. Its accuracy 
improves as it explores more nodes, but it doesn't achieve 
high accuracy on all the unexplored nodes until there are 
only about 60 of them left. 

Why is this? We argue that the fault lies, not with our 
learning algorithms and the order in which they explore the 
nodes, but with the stochastic block model and its abil- 
ity to model the data. For example, for the habitat vari- 
able, these algorithms perform well on pelagic, demersal, 
and land-based species. But the benthic habitat, which is 
the largest and most diverse, includes species with many 
feeding types and trophic levels. 

These additional variables have a large effect on the topol- 
ogy, but they are not taken into account by the block model. 
As a result, more than half the benthic species are mislabeled 
by the block model in the following sense: even if we con- 
dition on the correct habitats of all the other species, the 
species' most likely habitat is pelagic, benthopelagic, dem- 
ersal, or land-based. Specifically, 219 of the 488 species are 
mislabeled by the most likely block model, 94% of them with 
confidence over 0.9. 

Of course, we can also regard our algorithms' mistakes as 
evidence that these habitat classifications are not cut and 
dried. Indeed, ecologists recognize that there are "connector 
species" that connect one habitat to another, and belong to 
some extent to both. 

To test our hypothesis that it is the block model's inability 
to model the data that causes some nodes to be misclassi- 
fied, we artificially modified the data set to make it consis- 
tent with the block model. Starting with the nodes' original 
class labels, we updated the habitat of each species to its 
most likely value according to the block model, given the 
habitats of all the other species. After iterating this process 
six times, we reached a fixed point where each species' habi- 



tat is consistent with the block model's predictions. On this 
synthetic data set both of our learning algorithms perform 
perfectly, predicting the habitat of every species with close 
to 100% accuracy after exploring just 18% of them. 

More generally, it is important to remember that the topol- 
ogy of the network is only imperfectly correlated with the 
nodes' types. Zachary 00] relates that one of members of 
the Karate Club joined the instructor's faction even though 
the network's topology suggests that he was more strongly 
connected to the president. The reason is that he was only 
three weeks away from a test for his black belt when the 
split occurred. He had already invested four years learning 
the instructor's style of karate, and if he had joined the pres- 
ident's club he would have had to start over with a white 
belt. In any real-world network, there is information of this 
kind that is not refiected in the topology and which is hid- 
den from our algorithm. If a node is of a given class for 
idiosyncratic reasons like these, we cannot expect any algo- 
rithm based solely on topology and the other nodes' class 
labels — no matter how sophisticated a probabilistic model 
we use — to correctly classify it. 

6. COMPARISON WITH SIMPLE 
HEURISTICS 

We compared our active learning algorithms with several 
simple heuristics. These include exploring the node with the 
highest degree in the subgraph of unexplored nodes, explor- 
ing the node with the highest betweenness centrality (the 
fraction of shortest paths that go through it, see [8l 1271 [28] ) 
in the subgraph of unexplored nodes, and exploring a node 
chosen uniformly at random from the unexplored ones. We 
judge the performance of these heuristics using the same 
Gibbs sampling process as for MI and AA. 

In Fig. [T] we show the results of these heuristics at the 
0.9 accuracy threshold on all three networks, including both 
the habitat and feeding type variables in the food web. 
On Zachary's Karate Club (left) our algorithms outperform 
these heuristics consistently. In the David Copperfield net- 
work (right), the highest-degree and highest-betweenness 
heuristics enjoy an early lead, but quickly hit a ceiling and 
are surpassed by MI and AA. 

For the Weddell Sea food web (bottom), the highest-degree 
and highest-betweenness heuristics perform poorly through- 
out the learning process. One reason for this is that many 
nodes with high degree or high betweenness are easy to clas- 
sify from the labels of their neighbors. By exploring these 
nodes first, these heuristics leave themselves mainly with 
hard-to-classify nodes. The random-node heuristic performs 
surprisingly well early on, but all three heuristics are worse 
than MI or AA once they have explored half the nodes. 

7. CONCLUSION 

Active learning, using mutual information or average agree- 
ment coupled with a generative model, offers a new approach 
to analyzing networks where the topology is known, but 
knowledge of class labels is incomplete and costly to obtain. 
We have shown for three networks, one social, one lexical, 
and one biological, that our algorithms do a good job of 
predicting the labels of unexplored nodes after exploring a 
relatively small fraction of the network, correctly recognizing 
both assortative and disassortative functional communities. 
Certainly not all networks are well-described by the simple 
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Figure 7: A comparison of the MI and AA learning algorithms with three simple heuristics. 



block model we use here, but our approach can be general- 
ized to probabilistic network models which take information 
on the nodes' locations or degrees into account. 
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